Back to NotebooksBack
DownloadOutlier Test_v1
Outlier Detection Methods - Mathematical Formulas
1. Interquartile Range (IQR) Method
Definition: $$IQR = Q_3 - Q_1$$
Where:
- $Q_1$ = First quartile (25th percentile)
- $Q_3$ = Third quartile (75th percentile)
Outlier Boundaries: $$\text{Lower Bound} = Q_1 - 1.5 \times IQR$$ $$\text{Upper Bound} = Q_3 + 1.5 \times IQR$$
Outlier Criterion: A data point $x$ is classified as an outlier if: $$x < Q_1 - 1.5 \times IQR \quad \text{or} \quad x > Q_3 + 1.5 \times IQR$$
Pass/Fail Criterion: $$\text{Outlier Percentage} = \frac{\text{Number of Outliers}}{\text{Total Data Points}} \times 100%$$
- Station PASSES if: Outlier Percentage < 10%
- Station FAILS if: Outlier Percentage ≥ 10%
Step-by-Step Outlier Detection Analysis
Overview
This analysis performs outlier detection on precipitation data from 56 rainfall stations using the Interquartile Range (IQR) method. The goal is to identify stations with excessive outliers that may indicate data quality issues.
Analysis Steps
Step 1: Data Loading and Inspection
- Load precipitation data from Excel file
- Check dataset dimensions and structure
- Preview the first few rows to understand the data format
- Expected format: Date column + 56 station columns with precipitation values
Step 2: Understanding the IQR Method
- Mathematical foundation defined using quartiles (Q1, Q3)
- Outlier boundaries: Q1 - 1.5×IQR (lower) and Q3 + 1.5×IQR (upper)
- Pass/Fail Criterion: Stations with <10% outliers PASS, ≥10% FAIL
Step 3: Outlier Detection Implementation
- Apply IQR method to each station's precipitation data
- Calculate outlier counts, percentages, and boundary values
- Classify stations as PASSED or FAILED based on 10% threshold
- Generate summary statistics for all stations
Step 4: Export Results
- Save outlier detection results to Excel file
- Include: station names, outlier counts, percentages, bounds, and pass/fail status
Step 5: Visualization
- Create boxplots for all 56 stations (split across 4 figures)
- Each plot shows: data distribution, outliers (points beyond whiskers), and outlier statistics
- Visual inspection helps identify patterns and validate numerical results
Expected Outcomes
- Identification of stations with reliable data quality (< 10% outliers)
- Detection of problematic stations requiring further investigation
- Visual and numerical evidence for data quality assessment
Python
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as pltPython
file_source = r"\Eligibility Test\Data\grids_precipitation_data.xlsx"Python
# Load the data
df = pd.read_excel(file_source)
# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:\n{df.columns.tolist()}")
print(f"\nFirst few rows:\n{df.head()}")Python
# Function to detect outliers using IQR method
def detect_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = (data < lower_bound) | (data > upper_bound)
return outliers, lower_bound, upper_bound
# Apply outlier detection for each station (columns 1 to 56)
outlier_summary = {}
for col in df.columns:
if col != 'Date':
outliers_iqr, lower, upper = detect_outliers_iqr(df[col].dropna())
outlier_summary[col] = {
'total_outliers_iqr': outliers_iqr.sum(),
'lower_bound': lower,
'upper_bound': upper,
'outlier_percentage': (outliers_iqr.sum() / len(df[col].dropna())) * 100
}
# Display outlier summary
outlier_df = pd.DataFrame(outlier_summary).T
print(f"\nOutlier Summary:\n{outlier_df}")
# Define acceptable threshold for outlier percentage
acceptable_threshold = 10.0 # 10% threshold
# Check which stations pass the outlier test
outlier_df['passed_test'] = outlier_df['outlier_percentage'] < acceptable_threshold
# Display stations that passed and failed
print(f"\nStations that PASSED the outlier test (< {acceptable_threshold}%):")
print(outlier_df[outlier_df['passed_test']].index.tolist())
print(f"\nStations that FAILED the outlier test (>= {acceptable_threshold}%):")
print(outlier_df[~outlier_df['passed_test']].index.tolist())
print(f"\nSummary: {outlier_df['passed_test'].sum()} out of {len(outlier_df)} stations passed the test")Python
# Save the outlier test results to Excel
output_file = r"\outlier_test_results.xlsx"
# Create a copy of outlier_df and reset index to include station names as a column
outlier_results = outlier_df.copy()
outlier_results.insert(0, 'Station', outlier_results.index)
# Save to Excel
outlier_results.to_excel(output_file, index=False)
print(f"Outlier test results saved to: {output_file}")Python
# Create boxplots for all stations with outlier information
# Split into 4 separate figures for better visibility
n_stations = len([col for col in df.columns if col != 'Date'])
stations_per_fig = 14 # 14 stations per figure (2 rows x 7 columns)
station_cols = [col for col in df.columns if col != 'Date']
for fig_num in range(4):
start_idx = fig_num * stations_per_fig
end_idx = min(start_idx + stations_per_fig, n_stations)
n_plots = end_idx - start_idx
n_rows = 2
n_cols = 7
fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 8))
axes = axes.flatten()
# Plot stations for this figure
for i in range(n_plots):
col = station_cols[start_idx + i]
axes[i].boxplot(df[col].dropna(), vert=True)
outlier_pct = outlier_df.loc[col, 'outlier_percentage']
outlier_count = int(outlier_df.loc[col, 'total_outliers_iqr'])
axes[i].set_title(f'Station {col}\nOutliers: {outlier_count} ({outlier_pct:.2f}%)', fontsize=9)
axes[i].set_ylabel('Precipitation (mm)')
axes[i].grid(True, alpha=0.3)
# Hide unused subplots
for i in range(n_plots, n_rows * n_cols):
axes[i].axis('off')
plt.tight_layout()
plt.show()